Reflections on the Penn Discourse TreeBank, Comparable Corpora, and Complementary Annotation
نویسندگان
چکیده
The Penn Discourse Treebank (PDTB) was released to the public in 2008. It remains the largest manually annotated corpus of discourse relations to date. Its focus on discourse relations that are either lexically grounded in explicit discourse connectives or associated with sentential adjacency has not only facilitated its use in language technology and psycholinguistics but also has spawned the annotation of comparable corpora in other languages and genres. Given this situation, this paper has four aims: (1) to provide a comprehensive introduction to the PDTB for those who are unfamiliar with it; (2) to correct some wrong (or perhaps inadvertent) assumptions about the PDTB and its annotation that may have weakened previous results or the performance of decision procedures induced from the data; (3) to explain variations seen in the annotation of comparable resources in other languages and genres, which should allow developers of future comparable resources to recognize whether the variations are relevant to them; and (4) to enumerate and explain relationships between PDTB annotation and complementary annotation of other linguistic phenomena. The paper draws on work done by ourselves and others since the corpus was released.
منابع مشابه
Towards interoperable discourse annotation. Discourse features in the Ontologies of Linguistic Annotation
This paper describes the extension of the Ontologies of Linguistic Annotation (OLiA) with respect to discourse features. The OLiA ontologies provide a a terminology repository that can be employed to facilitate the conceptual (semantic) interoperability of annotations of discourse phenomena as found in the most important corpora available to the community, including OntoNotes, the RST Discourse...
متن کاملAnnotation And Data Mining Of The Penn Discourse TreeBank
The Penn Discourse TreeBank (PDTB) is a new resource built on top of the Penn Wall Street Journal corpus, in which discourse connectives are annotated along with their arguments. Its use of standoff annotation allows integration with a stand-off version of the Penn TreeBank (syntactic structure) and PropBank (verbs and their arguments), which adds value for both linguistic discovery and discour...
متن کاملPenn Discourse Treebank: Building a Large Scale Annotated Corpus Encoding DLTAG-based Discourse Structure and Discourse Relations
Large scale annotated corpora have played a critical role in speech and natural language research. However, while existing annotated corpora such as the Penn Treebank have been highly successful at the sentence-level, we also need large-scale annotated resources that reliably encode key aspects of discourse. In this paper, we detail (1) our plans for building the Penn Discourse Treebank (PDTB),...
متن کاملFrom Sentence to Discourse: Building an Annotation Scheme for Discourse Based on Prague Dependency Treebank
The present paper reports on a preparatory research for building a language corpus annotation scenario capturing the discourse relations in Czech. We primarily focus on the description of the syntactically motivated relations in discourse, basing our findings on the theoretical background of the Prague Dependency Treebank 2.0 and the Penn Discourse Treebank 2. Our aim is to revisit the present-...
متن کاملHow consistent are our discourse annotations? Insights from mapping RST-DT and PDTB annotations
Discourse-annotated corpora are an important resource for the community, but they are often annotated according to different frameworks. This makes comparison of the annotations difficult, thereby also preventing researchers from searching the corpora in a unified way, or using all annotated data jointly to train computational systems. Several theoretical proposals have recently been made for m...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Computational Linguistics
دوره 40 شماره
صفحات -
تاریخ انتشار 2014